This is a an introduction project and uses the house prices from Kaggle.com
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from math import log
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.metrics import mean_squared_error as MSE
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
import math
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingRegressor
train_df = pd.read_csv('train.csv')
train_df
Here there is some information about the dataset that's is going to help for the preprocessing phase. (phase 1)
train_df.info()
train_df.describe()
percetage_of_nonnull = train_df.count() / train_df.shape[0] * 100
percetage_of_nonnull.sort_values()
plt.figure(figsize=(10, 10))
sns.heatmap(train_df.corr())
%%time
train_df.corr()['SalePrice'].sort_values(ascending=False)
As you can see OverallQual, GrLivArea, GarageCars and GarageArea have the most correlation with SalePrice so these are most likely to have a significant impact on the price of a house.
train_df_pricelog = train_df.copy(True)
train_df_pricelog['SalePrice'] = np.log(train_df_pricelog['SalePrice'])
train_df_pricelog
plt.figure(figsize=(10, 10))
sns.heatmap(train_df_pricelog.corr())
train_df_pricelog.corr()['SalePrice'].sort_values(ascending=False)
As you see taking logarithm of the SalePrice col doesn't really change the corrlation values so we coclude that SalePrice doesn't have a exponensial relationship with any of these features and the relation is most likely to be linear
The four features that I mentioned in previous section has correlation between them like GarageCars and GarageArea so one of can be used and the other one can be omitted.
Here you can see the relation between sale price and other features
plt.scatter(train_df.SalePrice, train_df.OverallQual)
plt.scatter(train_df.SalePrice, train_df.GrLivArea)
plt.scatter(train_df.SalePrice, train_df.GarageCars)
plt.scatter(train_df.SalePrice, train_df.GarageArea)
plt.hexbin(train_df.SalePrice, train_df.GrLivArea)
plt.hexbin(train_df.SalePrice, train_df.GarageArea)
print("Houses with price less than 100k : ", train_df.loc[train_df.SalePrice < 100000].shape[0] / train_df.shape[0] * 100)
print("Houses with price between 100k and 200k : ", train_df.loc[(train_df.SalePrice < 200000) & (train_df.SalePrice > 100000)].shape[0] / train_df.shape[0] * 100)
print("Houses with price between 200k and 400k : ", train_df.loc[(train_df.SalePrice > 200000) & (train_df.SalePrice < 300000)].shape[0] / train_df.shape[0] * 100)
print("Houses with price more than 300k : ", train_df.loc[(train_df.SalePrice > 300000)].shape[0] / train_df.shape[0] * 100)
plt.scatter(train_df.SalePrice, train_df.OverallCond)
We did some examinations for getting familiar with the dataset and distribution of diffrent parameters. Now it's time to start the preprocessing phase for this dataset.
Here we use median for filling missing data becuase it's not as sensetive to noise as mean and also doesn't result in float values for integer featurs. other ways to handle missign data are deleting the column or the record completly which can't be good for small datasets or datasets with high rate of missing data.
Also we can use algorithems that support missing data and we can benefit from not having to handle missing data ourselves.
train_df.apply(lambda x: x.fillna(x.median, inplace=True) if x.dtype != 'O' else x)
train_df.info()
Here we delete columns with very little correlation with SalePrice and also we delete Utilities column becuase most of records have the same value for this feature.
print(train_df.Utilities.describe())
preprocessed_train_df = train_df.drop(columns=['GarageYrBlt', '1stFlrSF', 'GarageArea', 'Utilities', 'Id'], axis=1)
col_names = []
for ind, value in preprocessed_train_df.corr()['SalePrice'].iteritems():
if value < 0:
col_names.append(ind)
print(col_names)
preprocessed_train_df = preprocessed_train_df.drop(columns=col_names, axis=1)
preprocessed_train_df
As it's writen in the description of dataset na values can reffer to not having a certain feature so counting na values as lost and just omit the feature is not a good idea. we can replace this na value with None and then deal with the rest of the data.
Handeling categorical data have many methods. We can assign an integer to each label and use LabelEncoding.
For data that has an order between different valuse we can assing an integer that somehow shows an order between the labels.
And we can use One-Hot encoding that is speceficly a better choice for decision tree model.
Here we use the first two methods.
col_names = ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']
for col in col_names:
preprocessed_train_df[col].fillna('None', inplace=True)
print(preprocessed_train_df['Electrical'].value_counts())
preprocessed_train_df['Electrical'].fillna(preprocessed_train_df['Electrical'].mode()[0], inplace=True)
print()
print(preprocessed_train_df.MasVnrType.describe())
preprocessed_train_df['MasVnrType'].fillna(preprocessed_train_df['MasVnrType'].mode()[0], inplace=True)
preprocessed_train_df.isna().sum().sum()
numeric_train_df = preprocessed_train_df.copy(deep=True)
print(preprocessed_train_df.LotShape.describe())
shape = {
'Reg': 3,
'IR1': 2,
'IR2': 1,
'IR3': 0
}
numeric_train_df.LotShape = preprocessed_train_df.LotShape.map(shape)
numeric_train_df.LotShape.describe()
print(preprocessed_train_df.LandSlope.describe())
slope = {
'Gtl': 3,
'Mod': 2,
'Sev': 1
}
numeric_train_df.LandSlope = preprocessed_train_df.LandSlope.map(slope)
numeric_train_df.LandSlope.describe()
quality = {
'Ex': 5,
'Gd': 4,
'TA': 3,
'Fa': 2,
'Po': 1,
'None': 0
}
col_names = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond', 'PoolQC']
for col in col_names:
numeric_train_df[col] = preprocessed_train_df[col].map(quality)
numeric_train_df.drop('LotFrontage', axis=1, inplace=True)
numeric_train_df.drop('MasVnrArea', axis=1, inplace=True)
numeric_train_df.info()
for col in numeric_train_df.columns:
if numeric_train_df[col].dtypes == 'object':
encode = LabelEncoder()
encode.fit(numeric_train_df[col].values)
numeric_train_df[col] = encode.transform(numeric_train_df[col].values)
numeric_train_df.info()
Values of different features have different ranges and features with greater range can effect more in the decision process of the model although it's not more imortant than other features so normalaizng will even the effect of each feature on the final result.
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(numeric_train_df)
normalized_df = pd.DataFrame(x_scaled)
normalized_df.columns = numeric_train_df.columns
normalized_df.describe()
After changing the type of every column to numeric we can eliminate the column which have vary little correlation with SalePrice. Doing this will increase the perfomance of the model and just keep the most important features to decide with.
smaller_df = numeric_train_df.copy()
corrs = numeric_train_df.corr().SalePrice
number = 0
for col in numeric_train_df.columns:
if corrs[col] < 0.1:
smaller_df.drop(col, axis = 1, inplace = True)
number += 1
print(number)
smaller_df
There isn't a rule for the value of P but keeping 75 to 85 percent of dataset for training the model can be a good choice the value highly depends of the size of the dataset. If it is small most of data should be used for training the model and if it is too large keeping the size of train dataset small can decrease the cost of training the model.
Since we are going to evaluate our model with the train dataset it is important to not just pick records with similar features and using random selection can eliminate the possibility of having a dataset with specific features.
train, test = train_test_split(smaller_df, train_size = 0.8)
train_x = train.drop('SalePrice', axis = 1)
train_y = train.SalePrice
test_x = test.drop('SalePrice', axis = 1)
test_y = test.SalePrice
print("Train ratio: ", train_x.shape[0] / numeric_train_df.shape[0] * 100, " %")
Here three different method including knn, linear regression and decision tree to determine the price of each house in the test dataset.
def get_error(ans, pre):
mae = MAE(ans, pre)
rmse = MSE(ans, pre, squared=False)
return mae, rmse
linear_reg = LinearRegression()
linear_reg.fit(train_x, train_y)
mae, rmse = get_error(test_y, linear_reg.predict(test_x))
print("Linear Rgression results: ")
print("MAE: ", mae)
print("RMSE", rmse)
mae_list = []
rmse_list = []
for i in range(1, 100):
knn = KNeighborsRegressor(n_neighbors=i)
knn.fit(train_x, train_y)
mae, rmse = get_error(test_y, knn.predict(test_x))
mae_list.append(mae)
rmse_list.append(rmse)
x = [i for i in range(1, 100)]
plt.plot(x,mae_list, label='MAE')
plt.plot(x,rmse_list, label='RMSE')
plt.xlabel('number of neighbors')
plt.ylabel('Error')
plt.legend()
plt.show()
knn = KNeighborsRegressor()
param_grid = { 'n_neighbors': [i for i in range(1, 100)] }
optimazer_knn = GridSearchCV(knn, param_grid, n_jobs=-1)
optimazer_knn.fit(train_x, train_y)
mae, rms = get_error(test_y, optimazer_knn.predict(test_x))
print("KNN results: ")
print("MAE: ", mae)
print("RMSE: ", rms)
mae_list = []
rmse_list = []
for i in range(1, 50):
knn = DecisionTreeRegressor(max_depth=i)
knn.fit(train_x, train_y)
mae, rmse = get_error(test_y, knn.predict(test_x))
mae_list.append(mae)
rmse_list.append(rmse)
x = [i for i in range(1, 50)]
plt.plot(x,mae_list, label='MAE')
plt.plot(x,rmse_list, label='RMSE')
plt.xlabel('number of neighbors')
plt.ylabel('Error')
plt.legend()
plt.show()
dt = DecisionTreeRegressor()
param_grid = { 'max_depth': [i for i in range(1, 50)] }
optimazer_dt = GridSearchCV(dt, param_grid, n_jobs=-1)
optimazer_dt.fit(train_x, train_y)
mae, rms = get_error(test_y, optimazer_dt.predict(test_x))
print("Decision Tree results: ")
print("MAE: ", mae)
print("RMSE: ", rms)
Overfitting is a condition that our model was trained very specificly for the train dataset that captured every mislead data that is in the dataset and this can degrade the perfomance of model when countering a new data.
train_mae, train_rms = get_error(train_y, linear_reg.predict(train_x))
test_mae, test_rms = get_error(test_y, linear_reg.predict(test_x))
print("Linear REgression")
print("train errors: ")
print("\tMAE\t\t\tRMSE")
print("\t", train_mae, "\t", train_rms)
print("test errors: ")
print("\tMAE\t\t\tRMSE")
print("\t", test_mae, "\t", test_rms)
train_mae, train_rms = get_error(train_y, optimazer_knn.predict(train_x))
test_mae, test_rms = get_error(test_y, optimazer_knn.predict(test_x))
print("KNN")
print("train errors: ")
print("\tMAE\t\t\tRMSE")
print("\t", train_mae, "\t", train_rms)
print("test errors: ")
print("\tMAE\t\t\tRMSE")
print("\t", test_mae, "\t", test_rms)
train_mae, train_rms = get_error(train_y, optimazer_dt.predict(train_x))
test_mae, test_rms = get_error(test_y, optimazer_dt.predict(test_x))
print("Decision Tree")
print("train errors: ")
print("\tMAE\t\t\tRMSE")
print("\t", train_mae, "\t", train_rms)
print("test errors: ")
print("\tMAE\t\t\tRMSE")
print("\t", test_mae, "\t", test_rms)
According the above results knn model might be underfit becuase its performance isn't good enough fot the train dataset. And decision tree model could be overfit becuase the gap between results of train and test dataset is a little more than it should be.
Fot testing the result of preprocess I tested the models with numeric_tarin_df, smaller_df and normalaized_df and the best result was taken with smaller_df becuase of smaller number of features that is effected on the result and dealing with each column as it should be when we changed the categorical features into numeric values.
As the last step in order to get the best results out of previously used models we combine them using methods like Random Forest and Voting Regression to get the minimum error.
rf = RandomForestRegressor()
param_grid = {
'max_depth': [i for i in range(3, 10)],
'n_estimators': [100, 200, 300]
}
optimazer_rf = GridSearchCV(rf, param_grid, n_jobs=-1, verbose=True)
optimazer_rf.fit(train_x, train_y)
mae, rms = get_error(test_y, optimazer_rf.predict(test_x))
print("Random Forest results: ")
print("MAE: ", mae)
print("RMSE: ", rms)
knn = KNeighborsRegressor(n_neighbors=8)
knn.fit(train_x, train_y)
mae, rms = get_error(test_y, knn.predict(test_x))
print("KKN without optimazing: ")
print("MAE: ", mae)
print("RMSE: ", rms)
dt = DecisionTreeRegressor(max_depth=10)
dt.fit(train_x, train_y)
mae, rms = get_error(test_y, dt.predict(test_x))
print("Decision Tree without optimazing: ")
print("MAE: ", mae)
print("RMSE: ", rms)
vr = VotingRegressor([('LG', linear_reg), ('KNN', knn), ('DT', dt)])
vr.fit(train_x, train_y)
mae, rms = get_error(test_y, vr.predict(test_x))
print("Voting Regression results: ")
print("MAE: ", mae)
print("RMSE: ", rms)
The advantage of using ensemble models is that each model has its own strength and weaknesses and combining the result of a few models that cover each others weaknesses can lead to a very accurate model that perform well in every different situations. This